Search CORE

153 research outputs found

Affine Invariant Covariance Estimation for Heavy-Tailed Distributions

Author: Ostrovskii Dmitrii
Rudi Alessandro
Publication venue
Publication date: 25/06/2019
Field of study

In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distributionWe prove that the proposed estimator

\widehat{\mathbf{S}}

admits an \textit{affine-invariant} bound of the form

(1-\varepsilon) \mathbf{S} \preccurlyeq \widehat{\mathbf{S}} \preccurlyeq (1+\varepsilon) \mathbf{S}

in high probability, where

\mathbf{S}

is the unknown covariance matrix, and

\preccurlyeq

is the positive semidefinite order on symmetric matrices. The result only requires the existence of fourth-order moments, and allows for

\varepsilon = O(\sqrt{\kappa^4 d\log(d/\delta)/n})

where

\kappa^4

is a measure of kurtosis of the distribution,

d

is the dimensionality of the space,

n

is the sample size, and

1-\delta

is the desired confidence level. More generally, we can allow for regularization with level

\lambda

, then

d

gets replaced with the degrees of freedom number. Denoting

\text{cond}(\mathbf{S})

the condition number of

\mathbf{S}

, the computational cost of the novel estimator is

O(d^2 n + d^3\log(\text{cond}(\mathbf{S})))

, which is comparable to the cost of the sample covariance estimator in the statistically interesing regime

n \ge d

. We consider applications of our estimator to eigenvalue estimation with relative error, and to ridge regression with heavy-tailed random design

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

FALKON: An Optimal Large Scale Kernel Method

Author: Carratino Luigi
Rosasco Lorenzo
Rudi Alessandro
Publication venue
Publication date: 01/01/2017
Field of study

Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially

O(n)

memory and

O(n\sqrt{n})

time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures.Comment: NIPS 201

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Genova

Learning with SGD and Random Features

Author: Carratino Luigi
Rosasco Lorenzo
Rudi Alessandro
Publication venue
Publication date: 01/12/2018
Field of study

Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

A Consistent Regularization Approach for Structured Prediction

Author: Ciliberto Carlo
Rosasco Lorenzo
Rudi Alessandro
Publication venue
Publication date: 01/01/2016
Field of study

We propose and analyze a regularization approach for structured prediction problems. We characterize a large class of loss functions that allows to naturally embed structured outputs in a linear space. We exploit this fact to design learning algorithms using a surrogate loss approach and regularization techniques. We prove universal consistency and finite sample bounds characterizing the generalization properties of the proposed methods. Experimental results are provided to demonstrate the practical usefulness of the proposed approach.Comment: 39 pages, 2 Tables, 1 Figur

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Genova

On the Sample Complexity of Subspace Learning

Author: Canas Guille D.
Rosasco Lorenzo
Rudi Alessandro
Publication venue
Publication date: 01/01/2013
Field of study

A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribu- tion, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods.Comment: Extendend Version of conference pape

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Genova

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Author: Bach Francis
Pillaud-Vivien Loucas
Rudi Alessandro
Publication venue
Publication date: 23/11/2018
Field of study

We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Exponential convergence of testing error for stochastic gradient methods

Author: Bach Francis
Pillaud-Vivien Loucas
Rudi Alessandro
Publication venue
Publication date: 06/07/2018
Field of study

We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing loss (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing error (classification error) converges exponentially fast if low-noise conditions are assumed

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server